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^T) Abstract 

^«. We consider deep neural networks (formally equivalent to sum-product networks fT9\), in which 

*vj the output of each node is a quadratic function of its inputs. Similar to other deep architectures, these 

. networks can compactly represent any function on a finite training set. The main goal of this paper is the 

O I derivation of a provably efficient, layer-by-layer, algorithm for training such networks, which we denote 

■^^ as the Basis Learner. Unlike most, if not all, previous algorithms for training deep neural networks, our 

algorithm comes with formal polynomial time convergence guarantees. Moreover, the algorithm is a 
universal learner in the sense that the training error is guaranteed to decrease at every iteration, and can 
eventually reach zero under mild conditions. We present practical implementations of this algorithm, as 
r— I well as preliminary but quite promising experimental results. We also compare our deep architecture to 

\^ other shallow architectures for learning polynomials, in particular kernel learning. 

o 1 Introduction 






in 



X 



One of the most significant recent developments in machine learning has been the resurgence of "deep 
^ learning", usually in the form of artificial neural networks. These systems are based on a multi-layered 

architecture, where the input goes through several transformations, with higher-level concepts derived from 
O lower-level ones. Thus, these systems are considered to be particularly suitable for hard AI tasks, such as 

t^ computer vision and language processing. 

■^ The history of such multi-layered systems is long and uneven. They have been extensively studied 

^ in the 80's and early 90's, but with mixed success, and were eventually displaced to a large extent by 

,-H shallow architectures such as the Support Vector Machine (SVM) and boosting algorithms. These shallow 

^ architectures not only worked well in practice, but also came with provably correct and computationally 

efficient training algorithms, requiring tuning of only a small number of parameters - thus allowing them to 

be incorporated into standard software packages. 

However, in recent years, a combination of algorithmic advancements, as well as increasing compu- 
tational power and data size, has led to a breakthrough in the effectiveness of neural networks, and deep 
learning systems have shown very impressive practical performance on a variety of domains (a few exam- 
ples include [ 16l [131 |20l |5l [H [171 |T3| as well as [4J and references therein). This has led to a resurgence of 
interest in such learning systems. 

Nevertheless, a major caveat of deep learning is - and always has been - its strong reliance on heuristic 
methods. Despite decades of research, there is no clear-cut guidance on how one should choose the architec- 
ture and size of the network, or the type of computations it performs. Even when these are chosen, training 
these networks involves non-convex optimization problems, which are often quite difficult. No worst-case 
guarantees are possible, and pulling it off successfully is still much of a black art, requiring specialized 
expertise and much manual work. 

1 



In this paper, we propose a provable and efficient algorithm to build and train a deep network for super- 
vised learning. By this, we mean an algorithm that: 

• Constructs a deep architecture: One which crucially relies on its multi-layered structure, in order to 
compactly represent complex predictors. 

• Comes with formal guarantees: In particular, it provably runs in polynomial time. Since the algorithm 
is principled, it is readily amenable to theoretical analysis and study. Moreover, the algorithm does 
not rely on complicated heuristics, and is easy to implement. 

• The algorithm is a universal learner, in the sense that the training error is guaranteed to decrease as the 
network increases in size, ultimately reaching zero under mild conditions. Moreover, a single run of 
the algorithm constructs an entire curve of predictors trading-off bias and variance (or in other words, 
training error vs. generahzation performance). 

• In its basic idealized form, the algorithm is parameter-free. The network is grown incrementally, 
where each added layer decreases the bias while increasing the variance. The process can be stopped 
once satisfactory performance is obtained. The architectural details of the network are automatically 
determined by theory. We describe a more efficient variant of the algorithm, which requires specifying 
the maximal width of the network in advance. Optionally, one can do additional fine-tuning (as we 
describe later on), but our experimental results indicate that even this rough tuning is already sufficient 
to get good results. 

Another benefit of our algorithm is that it can work with any loss function (convex in the network's predic- 
tion), and is not constrained to particular losses common in the deep learning literature, such as the squared 
loss. 

The algorithm we present trains a particular type of deep learning system, where each computational 
node computes a linear or quadratic function of its inputs. Thus, the predictors we learn are polynomial 
functions over the input space (which we take here to be M'^). The networks we learn are also a type of 
sum-product networks, which have been introduced in the context of efficient representations of partition 
functions |fT9l l9l. Thus, our work can also be seen as an efficient algorithm for training sum-product net- 
works in a supervised learning setting. 

The derivation of our algorithm is inspired by ideas from lITSll . used there for a different purpose. At its 
core, our method attempts to build a network which provides a good approximate basis for the values attained 
by all polynomials of bounded degree over the training instances. Crucially, it utilizes the deep structure of 
multi-layered networks in order to provide a compact representation. This is important, since the dimension 
of the vector space of all polynomials of bounded degree in M'^ is huge (exponential in the degree), and there 
is no hope to build such a representation explicitly. Efficiently representing high-dimensional spaces can 
also be achieved using kernels, but our approach is very different and has some important advantages which 
we discuss later on. Similar to a well-known principle in modern deep learning, the layers of our network are 
built one -by-one, creating higher-and-higher level representations of the data. Once such a representation is 
built, a final output layer is constructed by solving a simple convex optimization problem. 

The rest of the paper is structured as follows. In Sec. |2j we introduce notation. The heart of our paper is 
Sec. [3} where we present our algorithm and analyze its properties. In Sec.|4j we discuss sample complexity 
(generalization) issues. In Sec.|5]we compare our deep architecture for learning polynomials to the shallow 
architecture obtained by kernel learning. In Sec.[6j we present preliminary experimental results. 

Overall, we believe our work opens the door to the derivation of principled deep learning algorithms, 
with many possible avenues for future work. 



2 Preliminaries 

We use bold-face letters to denote vectors. In particular, 1 denotes the all-ones vector. For any two vectors 
g = {gi, . . . , g^), h = {hi, . . . , h^), we let g o h denotes their Hadamard product, namely the vector 
{gihi, . . . , gdhd)- \\ • \\ refers to the Euclidean norm. Ind(-) refers to the indicator function. 

For two matrices F, G with the same number of rows, we let [F G] denote the new matrix formed by 
concatenating the columns of F, G. For a matrix F, Fij refers to the entry in row i and column j; Fj refers 
to its j-th column; and |F| refers to the number of columns. 

We assume we are given a labeled training data {(xi, yi), . . . , (x^, Vm)}, where each Xj is in M*^, and 
yi is a scalar label/target value. We let X denote the matrix such that Xij = Xij, and y is the vector 
(yi, . . . , ym)- For simpUcity of presentation, we will assume that m > d, but note that for most results this 
can be easily relaxed. 

Given a vector of predicted values v on the training set (or a matrix F in a multi-class prediction setting), 
we use £{v, y) to denote the training error, which is assumed to be a convex function of v. Some examples 
include: 

• Squared loss: ^(v,y) = ^||v -y IP 

• Hinge loss: £(v, y) = ^ YaLi max{0, 1 - yivi} 

• Logistic loss: ^(v, y) = ^ YlTLi log(l + exp{-yiVi)) 

• Multiclass hinge loss: iiV, y) = — YllLi max{0, 1 + maxj^j^. Vij — Vi^y^} (here, Vij is the confi- 
dence score for instance i being in class j) 

Moreover, in the context of linear predictors, we can consider regularized loss functions, where we augment 
the loss by a regularization term such as ^ || w|p (where w is the linear predictor) for some parameter A > 0. 
Multivariate polynomials are functions over M'^, of the form 

j=0 Q,(i) 1 = 1 

where a^^^ ranges over all d-dimensional vectors of positive integers, such that Yl,i=i Q^i = ^' ^^^^ A is the 

degree of the polynomial. Each term 0^=1 ^/ ' i^ ^ monomial of degree i. 

To represent our network, we let n* (•) refer to the j-th node in the z-th layer, as a function of its inputs. 
In our algorithm, the function each node computes is always either a linear function, or a weighted product 
of two inputs: 

(2:1,2:2) i-> WZ1Z2 , 

where i(; G M. The depth of the network corresponds to the number of layers, and the width corresponds to 
the largest number of nodes in any single layer. 

3 The Basis Learner: Algorithm and Analysis 

We now turn to develop our Basis Learner algorithm, as well as the accompanying analysis. We do this in 
three stages: First, we derive a generic and idealized version of our algorithm, which runs in polynomial time 
but is not very practical; Second, we formally prove its efficiency in terms of time complexity, training error, 
and other theoretical properties; Third, we discuss and analyze a more realistic variant of our algorithm, 
which also enjoys good theoretical properties, generalizes better, and is more flexible in practice. 



3.1 Generic Algorithm 

Recall that our goal is to learn polynomial predictors, using a deep architecture, based on a training set 
with instances xi, . . . , x^- However, let us ignore for now the learning aspect and focus on a representa- 
tion problem: how can we build a network capable of representing the values of any polynomial over the 
instances? 

At first glance, this may seem like a tall order, since the space of all polynomials is not specified by 
any bounded number of parameters. However, our first crucial observation is that we care (for now) only 
about the values on the m training instances. We can represent these values as m-dimensional vectors in 
M™. Moreover, we can identify each polynomial p with its values on the training instances, via the linear 
projection 

P^ (p(xi),...,p(xm)). 

Since the space of all polynomials can attain any set of values on a finite set of distinct points lITOl . we 
get that polynomials span W^ via this linear projection. By a standard result from linear algebra, this 
immediately implies that there are m polynomials pi, . . . , p^, such that {(pj(xi), . . . , Pj(xm))}i^i form 
a basis of M*" - we can write any set of values (yi, . . . , y^) as a linear combination of these. Formally, we 
get the following: 

Lemma 1. Suppose xi, . . . ,Xm are distinct. Then there exist m polynomials pi, . . . ,Pm. such that: 
{(Pi(xi), . . . , V>i{y^m))]T=iform a basis ofW^. 

Hence, for any set of values (yi, . . . ,ym), there is a coefficient vector (wi, . . . ,Wm), so that 
Ya=i WiPii^^j) = Vjfar allj = l,..., m. 

This lemma implies that if we build a network, which computes such m polynomials pi, . . . , pm, then 
we can train a simple linear classifier on top of these outputs, which can attain any target values over the 
training data. 

While it is nice to be able to express any target values (yi, . . . , ym) as a function of the input instances 
(xi , . . . , Xm), such an expressive machine will likely lead to overfitting. Our generic algorithm builds a deep 
network such that the nodes of the first r layers form a basis of all values attained by degree-r polynomials. 
Therefore, we start with a simple network, which might have a large bias but will tend not to overfit (i.e. low 
variance), and as we make the network deeper and deeper we gradually decrease the bias while increasing 
the variance. Thus, in principle, this algorithm can be used to train the natural curve of solutions that can be 
used to control the bias-variance tradeoff. 

It remains to describe how we build such a network. First, we show how to construct a basis which 
spans all values attained by degree- 1 polyonomials (i.e. linear functions). We then show how to enlarge 
this to a basis of all values attained by degree-2 polynomials, and so on. Each such enlargement of the 
degree corresponds to another layer in our network. Later, we will prove that each step can be calculated in 
polynomial time and the whole process terminates after a polynomial number of iterations. 

While we do not need to make any special assumptions on the data structure, we note that our idealized 
algorithm works particularly well when certain algebraic assumptions are met. As the most simple example, 
consider the case where the data lies on an affine manifold. In that case, a small number of degree- 1 
polynomials span all linear functions. A thin first layer (which spans degree-1 polynomials) will not result 
in a deterioration of the training error. This continues to deeper layers; If our data lies for example on the 
unit sphere, then choosing thinner higher level layers (that correspond to higher degree polynomials) will 
not affect the training error, as we simply remove exponentially many polynomials that vanish on the data 
set. 



3.1.1 Constructing the First Layer 

The set of values attained by degree- 1 polynomials (linear) functions over the data is 

{((w, [1 xi]), . . . , (w, [1 x„])) : w G M^+i} , (2) 

which is a (i + 1-dimensional linear subspace of M*". Thus, to construct a basis for it, we only need to find 
d + 1 vectors wi, . . . , vi^d+i. so that the set of vectors {((wj, [1 xi]), . . . , {wj, [1 x^]))} -^i are linearly 
independent. This can be done in many ways. For example, one can construct an orthogonal basis to 
Eq. (|2]l, using Gram-Schmidt or SVD (equivalently, finding a (d + 1) x (d + 1) matrix W, so that [1 X]W 
has orthogonal columns|j At this stage, our focus is to present our approach in full generality, so we avoid 
fixing a specific basis-construction method. 

Whatever basis-construction method we use, we end up with some linear transformation (specified by 
a matrix W), which maps [1 X] into the constructed basis. The columns of W specify the d + I linear 
functions forming the first layer of our network: For all j = 1, . . . , d + 1, the j'th node of the first layer is 
the function 

n]{^) = {Wj,[lX]), 

and we have the property that {(nkxi), . . . , nkxm))},i| is a basis for all values attained by degree-1 
polynomials over the training data. We let F^ denote the m x (d + 1) matri)(H whose columns are the 
vectors of this set, namely, F^j = nj(xi). 

3.1.2 Constructing The Second Layer 

So far, we have a one-layer network whose outputs span all values attained by linear functions on the training 
instances. In principle, we can use the same trick to find a basis for degree-2, 3, . . . polynomials: For any 
degree A polynomial, consider the space of all values attained by such polynomials over the training data, 
and find a spanning basis. However, we quickly run into a computational problem, since the space of all 
degree A polynomials in MJ^ (d > 1) increases exponentially in A, requiring us to consider exponentially 
many vectors. Instead, we utilize our deep architecture to find a compact representation of the required 
basis, using the following simple but important observation: 

Lemma 2. Any degree t polynomial can be written as 

^gi(x)hi(x)+k(x), 
i 

where gi(x) are degree-1 polynomials, hj(x) are degree-{t — 1) polynomials, and k(x) is a polynomial of 
degree at most t — 1. 

Proof. Any polynomial of degree t can be written as a weighted sum of monomials of degree t, plus a 
polynomial of degree < t — 1. Moreover, any monomial of degree t can be written as a product of a 
monomial of degree t — 1 and a monomial of degree 1. Since {t — 1) -degree monomials are in particular 
(t — l)-degree polynomials, the result follows. D 



'This is essentially the same as the first step of the VGA algorithm in 1181 . Moreover, it is very similar to performing Principal 
Component Analysis (PCA) on the data, which is a often a standard first step in learning. It differs from PCA in that the SVD 
is done on the augmented matrix [IX], rather than on a centered version of X. This is significant here, since the columns of a 
centered data matrix X cannot express the 1 vector, hence we cannot express the constant 1 polynomial on the data. 

^If the data lies in a subspace of W^ then the number of columns of F^ will be the dimension of this subspace plus 1. 



The lemma implies that any degree-2 polynomial can be written as the sum of products of degree- 1 
polynomials, plus a degree- 1 polynomial. Since the nodes at the first layer of our network span all degree- 1 
polynomials, they in particular span the polynomials gj, hj, k, so it follows that any degree-2 polynomial 
can be written as 

= j;n](x)nj(x) (e«^^«^''^) +E-]W(«r) ' 

j,r \ i / j 

where all the a's are scalars. In other words, the vector of values attainable by any degree-2 polynomial is 
in the span of the vector of values attained by nodes in the first layer, and products of the outputs of every 
two nodes in the first layer. 

Let us now switch back to an algebraic representation. Recall that in constructing the first layer, we 
formed a matrix F^, whose columns span all values attainable by degree- 1 polynomials. Then the above 
implies that the matrix [F F^] , where 



F2 



{Fl o F/) • • • (F/ o F') . . . (Fi I o F/) • • • (Fi I o F,\ 



spans all possible values attainable by degree-2 polynomials. Thus, to get a basis for the values attained by 
degree-2 polynomials, it is enough to find some column subset F^ of F^, so that [F F^]'s columns are a 
linearly independent basis for [F F^]'s columns. Again, this basis construction can be done in several ways, 
using standard linear algebra (such as a Gram-Schmidt procedure or more stable alternative methods). The 
columns of F^ (which are a subset of the columns of F^) specify the 2nd layer of our network: each such 
column, which corresponds to (say) F/ o F^, corresponds in turn to a node in the 2nd layer, which computes 
the product of nodes nj{-) and nU-) in the first layer. We now redefine F to be the augmented matrix 
[F F'~ 
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3.1.3 Constructing Layer 3,4,. . . 

It is only left to repeat this process. At each iteration t, we maintain a matrix F, whose columns form a basis 
for the values attained by all polynomials of degree < t — 1. We then consider the new matrix 

{Ff o Fl) • • • (F*-i o F|i^^|) • • • (FiV/-,, o Fl) • • • (i^y_,| o Fi^^|] 

and find a column subset F* so that the columns of [F F*] form a basis for the columns of [F F*] . We then 
redefine F := [F F*], and are assured that the columns of F span the values of all polynomials of degree 
< t over the data. By adding this newly constructed layer, we get a network whose outputs form a basis for 
the values attained by all polynomials of degree < t over the training instances. 

To maintain numerical stability, it may be desirable to multiply each column of F* by a normalization 
factor, e.g. by scaling each column F/ so that the second moment -^ 1 1 F/ 1 P across the column is 1 (otherwise, 
the iterated products may make the values in the matrix very large or small). Overall, we can specify the 
transformation from Ft to F* via a matrix W of size |F*^^| x |F^|, so that for any r = 1, . . . , |F*|, 



Initialize F as an empty matrix, and F^ := [1 X] 
{F\W^) :=BuildBasisi(Fi) 

// Columns of F^ are linearly independent, and F^ = F^W^ 
Create first layer: Vi G {1, . . . , \F^\}, ni(x) := {Wl, [1 x]) 
F :=F^ 
For t = 2, 3, . . . 

Create candidate output layer: (n°"'P"'(), error) := OutputLayer(F) 

If error sufficiently small, break 



F*: = 



[Ff ° 






F/) (Ft'oF,^) ... 

(F*, VF*) := BuildBasis*(F, F*) 

// Columns of [F F*] are linearly independent 

// W* is such that F* = M^*, , ., .(F*"^ o F^ ) 
If |F*| =0, break 
Create layer t: For each non-zero element W^i(r)j(r) ^^ ^^ r = 1, .., 

F := IF F*l 



OutputLayer(F) 



w := argmin^gjjiFi ^(Fw,y) 



n 



output 



(•):=(w,n(.)>, 



where n(-) 



n^ 



■),nl{-) 



,n 



\F' 



Let error be the error of n™'P"'(-) on a validation data set 
Return (n™'P"'(-), error) 



t^_i| (•) ) consists of the outputs of all nodes in the network 



Figure 1: The Basis Learner algorithm. The top box is the main algorithm, which constructs the net- 
work, and the bottom box is the output layer construction procedure. At this stage, BuildBasis^ and 
BuildBasis* are not fixed, but we provide one possible implementation in Figure [s] 

As we will prove later on, if at any stage the subspace spanned by [F F*] is the same as the subspace 
spanned by [F F*], then our network can span the values of all polynomials of any degree over the training 
data, and we can stop the process. 

The process (up to the creation of the output layer) is described in Figure [T] and the resulting network 
architecture is shown in Figure [2] We note that the resulting network has a feedforward architecture. The 
connections, though, are not only between adjacent layers, unlike many common deep learning architec- 
tures. Moreover, although we refrain from fixing the basis creation methods at this stage, we provide one 
possible implementation in Figure |3] We emphasize, though, that other variants are possible, and the basis 
construction method can be different at different layers. 



Constructing the Output Layer After A — 1 iterations (for some A), we end up with a matrix F, whose 
columns form a basis for all values attained by polynomials of degree < A — 1 over the training data. More- 
over, each column is exactly the values attained by some node in our network over the training instances. 
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Figure 2: Schematic diagram of the network's architecture, for polynomials of degree 4. Each element 
represents a layer of nodes, as specified in Figure [T] (+) represent a layer of nodes which compute functions 
of the form n(z) = ^^WiZi, while other layers consist of nodes which compute functions of the form 
n(z) = n((2;j(i), 2:j(2))) = u^2^i(i)^i(2)- I^i the diagram, computation moves top to bottom and left to right. 



On top of this network, we can now train a simple linear predictor w, miniming some convex loss function 
w I—)- £{F-w, y). This can be done using any convex optimization procedure. We are assured that for any 
polynomial of degree at most A — 1, there is some such linear predictor w which attains the same value as 
this polynomial over the data. This linear predictor forms the output layer of our network. 

As mentioned earlier, the inspiration to our approach is based on [18], which present an incremental 
method to efficiently build a basis for polynomial functions. In particular, we use the same basic ideas in 
order to ensure that after t iterations, the resulting basis spans all polynomials of degree at most t. While we 
owe a lot to their ideas, we should also emphasize the differences: First, the emphasis there is to find a set 
of generators for the ideal of polynomials vanishing on the training set. Second, their goal there has nothing 
to do with deep learning, and the result of the algorithm is a basis rather than a deep network. Third, they 
build the basis in a different way than ours (forcing orthogonality of the basis components), which does not 
seem as effective in our context (see end of section Sec. [6]l. Fourth, the practical variant of our algorithm, 
which is described further on, is very different than the methods used in ifTSl . 

Before continuing with the analysis, we make several important remarks: 

Remark 1 (Number of layers does not need to be fixed in advance). Each iteration of the algorithm cor- 
responds to another layer in the network. However, note that we do not need to specify the number of 
iterations. Instead, we can simply create the layers one-by-one, each time attempting to construct an output 
layer on top of the existing nodes. We then check the performance of the resulting network on a validation 
set, and stop once we reach satisfactory performance. See Figure^^or details. 

Remark 2 (Flexibility of loss function). Compared to our algorithm, many standard deep learning algo- 
rithms are more constrained in terms of the loss function, especially those that directly attempt to minimize 
training error Since these algorithms solve hard, non-convex problems, it is important that the loss will 
be as "nice " and smooth as possible, and they often focus on the squared loss (for example, the famous 
backpropagation algorithm [21] is tailored for this loss). In contrast, our algorithm can easily work with 
any convex loss. 



BuildBasis^(F^) - example 




Compute SVD: F^ = LDW^ 




Delete columns Wi where Dj j = 




B := F^W 




Fori = l,...,\W\ 




b■.= ^/\\B^\\■ B^:=bBr, W^■.-- 


= bWi 


Return {B, W) 





BuildBasis*(F, F*) - example 



Initialize F* := [], VF :=0 

Compute orthonormal basis O^ of F's columns 

//Computed from previous call to BuildBasis*, 

// or directly via QR or SVD decomposition of F 
Forr = 1,...,|F*| 

c := F^ - O^iO^yPl 

Ifllcll > tol 



Ft 

W, 



Ft 



Wi/\\F^ 



i(r),j{r) - 

II i{r),j{r) are those for which F^ 



O^ := O' 



Return (F*, W) 



?t-i 

i{r) 



F'r' o F^ 



iir) 



Figure 3: Example Implementations of the BuildBasis-*^ and BuildBasis* procedures. 
BuildBasis-*^ is implemented to return an orthogonal basis for F^'s columns via SVD, while 
BuildBasis* uses a Gram-Schmidt procedure to find an appropriate columns subset of F*, which to- 
gether with F forms a basis for [F F*]'s columns. In the pseudo-code, tol is a tolerance parameter (e.g. 
machine precision). 



Remark 3 (Choice of Architecture). In the intermediate layers we proposed constructing a basis for the 
columns of [F F^\ by using the columns ofF and a column subset ofF*. However, this is not the only way to 
construct a basis. For example, one can try and find a full linear transformation W^ so that the columns of 
[FF^\W^ form an orthogonal basis to [F F*]. However, our approach combines two important advantages. 
On one hand, it creates a network with few connections where most nodes depend on the inputs of only 
two other nodes. This makes the network very fast at test-time, as well as better-generalizing in theory and 
in practice (see Sec. |4] and Sec. ^for more details). On the other hand, it is still sufficiently expressive 
to compactly represent high-dimensional polynomials, in a pro duct- of- sums form, whose expansion as an 
explicit sum of monomials would be prohibitively large. In particular, our network computes functions of the 
form X I— )• ^ ■ Oj Yl^{tfl + (av^ , x)), which involve exponentially many monomials. The ability to compactly 
represent complex concepts is a major principle in deep learning [4]. This is also why we chose to use a 
linear transformation in the first layer - if all non-output layers just compute the product of two outputs from 
the previous layers, then the resulting predictor is limited to computing polynomials with a small number of 
monomials. 

Remark 4 (Connection to Algebraic Geometry). Our algorithm has some deep connections to algebraic 
geometry and interpolation theory. In particular, the problem of finding a basis for polynomial ffinctions 
on a given set has been well studied in these areas for many years. However, most methods we are aware 
of - such as construction of Newton Basis polynomials or multivariate extensions of standard polynomial 
interpolation methods HIOV - are not computationally efficient, i.e. polynomial in the dimension d and the 
polynomial degree A. This is because they are based on explicit handling of monomials, of which there 
are (^ ^ ). Efficient algorithms have been proposed for related problems, such as the Buchberger-Moller 
algorithm for finding a set of generators for the ideal of polynomials vanishing on a given set (see fTl\ |7] 1781/ 
and references therein). In a sense, our deep architecture is "orthogonal" to this approach, since we focus 
on constructing a bsis for polynomials that do not vanish on the set of points. This enables us to find an 
efficient, compact representation, using a deep architecture, for getting arbitrary values over a training set. 

3.2 Analysis 

After describing our generic algorithm and its derivation, we now turn to prove its formal properties. In 
particular, we show that its runtime is polynomial in the training set size m and the dimension d, and that it 
can drive the training error all the way to zero. In the next section, we discuss how to make the algorithm 
more practical from a computational and statistical point of view. 

Theorem 1. Given a training set (xi, yi), . . . , (x^,, ym), where xi, . . . , x^ are distinct points in W^, sup- 
pose we run the algorithm in Figure^ constructing a network of total depth A. Then: 

1. \F\ < m, \F^\ < d + 1, maxi |F*| < m, maxj |F*| < m{d + 1). 

2. The algorithm terminates after at most min{m — 1, A — 2} iterations of the For loop. 

3. Assuming (for simplicity) d < m, the algorithm can be implemented using at most 0{m?) memory 
and 0{dm'^) time, plus the polynomial time required to solve the convex optimization problem when 
computing the output layer 

4. The network constructed by the algorithm has at most niin{?7i + 1, A} layers, width at most m, and 
total number of nodes at most m + 1. The total number of arithmetic operations (sums and products) 
performed to compute an output is 0{m -\- d^). 
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5. At the end of iteration t, F's columns span all values attainable by polynomials of degree < t on the 
training instances. 

6. The training error of the network created by the algorithm is monotonically decreasing in A. More- 
over, if there exists some vector of prediction values v such that £(v, y) = 0, then after at most m 
iterations, the training error will be 0. 

In item [6} we note that the assumption on £ is merely to simplify the presentation. A more precise 
statement would be that we can get the training error arbitrarily close to infv ^(v, p) - see the proof for 
details. 

Proof. The theorem is mostly an easy corollary of the derivation. 

As to item[Tj since we maintain the m-dimensional columns of F and each F* to be linearly independent, 
there cannot be more than m of them. The bound on | F^ | follows by construction (as we orthogonalize a 
matrix with d + 1 columns), and the bound on |F*| now follows by definition of F*. 

As to itempj the algorithm always augments F by F*, and breaks whenever |F*| = 0. Since F can 
have at most m columns, it follows the algorithm cannot run more than m iterations. The algorithm also 
terminates after at most A — 2 iterations, by definition. 

As to item|3] the memory bound follows from the bounds on the sizes of F, F*, and the associated sizes 
of the constructed network. Note that F* can require as much as 0{dm?') memory, but we don't need to 
store it explicitly - any entry in F* is specified as a product of two entries in F^ and F*^^, which can be 
found and computed on-the-fly in 0(1) time. As to the time bound, each iteration of our algorithm involves 
computations polynomial in m, d, with the dominant factors being the BuildBasis* and BuildBasis^. 
The time bounds follow from the the implementations proposed in Figure [3} using the upper bounds on the 
sizes of the relevant matrices, and the assumption that d < m. 

As to itemffl it follows from the fact that in each iteration, we create layer t with at most |F*| new nodes, 
and there are at most min{?TT, — 1, A — 2} iterations/layers excluding the input and output layers. Moreover, 
each node in our network (except the output node) corresponds to a column in |F|, so there are at most 
m nodes plus the output nodes. Finally, the network computes a linear transformation in M*^, then at most 
m nodes perform 2 products each, and a final output node computes a weighted linear combination of the 
output of all other nodes (at most m) - so the number of operations is 0{'m + d'^). 

As to item [5} it follows immediately by the derivation presented earlier. 

Finally, we need to show item [6] Recall from the derivation that in the output layer, we use the linear 
weights w which minimize £{F-w, y). If we increase the depth of our constructed network, what happens is 
that we augment F by more and more linearly independent columns, the initial columns being exactly the 
same. Thus, the size of the set of prediction vectors {Fw : w G W^'} only increases, and the training error 
can only go down. 

If we run the algorithm till \F\ = m, then the columns of F span M™, since the columns of F are 
linearly independent. Hence {Fw : w G RI^I} = M™. This implies that we can always find w such that 
Fw = V, where ^(v, y) = 0, so the training error is zero. The only case left to treat is if the algorithm 
stops when \F\ < m. However, we claim this can't happen. This can only happen if |F*| = after the 
basis construction process, namely that F's columns already span the columns of F*. However, this would 
imply that we can span the values of all degree-t polynomials on the training instances, using polynomials 
of degree < t — I. But using Lemma[2} it would imply that we could write the values of every degree-t + 1 
polynomial using a linear combination of polynomials of degree < t — 1. Repeating this, we get that the 
values of polynomials of degree t + 2 , t + 3 , . . . are all spanned by polynomials of degree t — 1. However, the 
values of all polynomials of any degree over m distinct points must span M™, so we must have | F| = tti. D 
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An immediate corollary of this result is the following: 

Remark 5 (The Basis Learner is a Universal Learner). Our algorithm is a universal algorithm, in the sense 
that as we run it for more and more iterations, the training error provably decreases, eventually hitting zero. 
Thus, we can get an entire curve of solutions, trading-off between the training error on one hand and the 
size of the resulting network (as well as the potential of overfitting) on the other hand. Furthermore, since 
any target predictor can be approximated by a polynomial of some degree t, it follows that with a sufficiently 
large training set, after performing t iterations of our algorithm we are guaranteed to learn a competitive 
predictor. 

3.3 Making the Algorithm Practical 

While the algorithm we presented runs in provable polynomial time, it has some important limitations. In 
particular, while we can always control the depth of the network by early stopping, we do not control its 
width (i.e. the number of nodes created in each layer). In the worst case, it can be as large as the number of 
training instances m. This has two drawbacks: 

• The algorithm can only be used for small datasets - when m is large, we might get huge networks, 
and running the algorithm will be computationally prohibitive, involving manipulations of matrices 
of order m x md. 

• Even ignoring computational constraints, the huge network which might be created is likely to overfit. 

To tackle this, we propose a simple modification of our scheme, where the network width is explicitly 
constrained at each iteration. Recall that the width of a layer constructed at iteration t is equal to the number 
of columns in F*. Till now, F* was such that the columns of [F F*] span the column space of [F F*]. So if 
|F*| is large, |F*| might be large as well, resulting in a wide layer with many new nodes. However, we can 
give up on exactly spanning F*, and instead seek to "approximately span" it, using a smaller partial basis of 
bounded size 7, resulting in a layer of width 7. 

The next natural question is how to choose this partial basis. There are several possible criterions, both 
supervised and unsupervised. We will focus on the following choice, which we found to be quite effective 
in practice: 

• The first layer computes a linear transformation which transforms the augmented data matrix [1 X] 
into its first 7 leading singular vectors (this is closely akin - although not identical - to Principal 
Component Analysis (PCA) - see Footnote [T]l. 

• The next layers use a standard Orthogonal Least Squares procedure @ to greedily pick the columns 
of F* which seem most relevant for prediction. The intuition is that we wish to quickly decrease 
the training error, using a small number of new nodes and in a computationally cheap way. Specif- 
ically, for binary classification and regression, we consider the vector y = (yi, . . . , y^) of training 
labels/target values, and iteratively pick the column of F* whose residual (after projecting on the ex- 
isting basis F) is most correlated with the residual of y (again, after projecting on the existing basis 
F). The column is then added to the existing basis, and the process repeats itself. A simple extension 
of this idea can be applied to the multiclass case. Finally, to speed-up the computation, we can process 
the columns of F* in mini-batches, where each time we find and add the b (b > 1) most correlated 
vectors before iterating. 
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BuildBasis 


\F') 


- Width-Limited Variant 




Parameter: Layer width 7 > 








Compute SVD: F^ = LDW^ 








W := [Wi W2 ■■■ W^] 








II Assumed to be the columns corresponding to 7 largest non-zero singular values 


B := F^W 








YoYi = l,...,\W\ 








h:=^/\\B,\\; B,:=bB, 


; W, 


= bWi 




Return {B, W) 









BuildBasisHi^, F*) - Width-Limited Variant 

Parameter: Layer width 7 > 0, batch size b 

Let V denote target value vector/matrix (see caption) 

Initialize F* ■=[],]¥ :=0 

Compute orthonormal basis O^ of F's columns 

// Computed in the previous call to BuildBasis*, 

// or directly via QR or SVD decomposition of F 
V:=V-0^{0^yV 
Forr = l,2,...,(7/6) 

C := F* - O^iO^yP* 

Q:=^QforalH = l,...,|C| 

Compute orthonormal basis O^ of y s columns 
Let i(l), . . . , i{b) be indices of the b linearly independent columns 
of {0^)^C with largest positive norm 

Forr = i{l),i{2),...,i{b): 



F* 



-F* 



IK r II ^ 

'i(r),j{r) = \/^/l|-^rll 

// i{r),j{r) are those for which F^ = -Pff^l ° ^l(r) 
Compute orthonormal basis O*^ of columns of [Cj(i) Cj(2) • • • ^(6)] 

O^ := [O^ O^] 
V:=V-0^\0^)'^V 

Return {F\W) 



Figure 4: Practical width-limited implementations of the BuildBasis-*^ and BuildBasis* procedures. 
BuildBasis-*^ is implemented to return an orthogonal partial basis for F^'s, which spans the largest sin- 
gular vectors of the data. BuildBasis* uses a supervised OLS procedure in order to pick a partial basis 
for [F F*], which is most useful for prediction. In the code, V represents the vector of training set la- 
bels (yi, . . . , yrn) for binary classification and regression, and the indicator matrix Vij = Ind(yj = j) for 
multiclass prediction. For simplicity, we assume the batch size b divides the layer width 7. 
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These procedures are implemented via the subroutines BuildBasis^ and BuildBasis*, whereas the 
main algorithm (Figure [T]) remains unaffected. A precise pseudo-code appears in Figure |4] We note that in 
a practical implementation of the pseudo-code, we do not need to explicitly compute the potentially large 
matrices C, Ft - we can simply compute each column and its associated correlation score one-by-one, and 
use the list of scores to pick and re-generate the most correlated columns. 

We now turn to discuss the theoretical properties of this width-constrained variant of our algorithm. 
Recall that in its idealized version, the Basis Learner is guaranteed to eventually decrease the training error 
to zero in all cases. However, with a width constraint, there are adversarial cases where the algorithm will 
get "stuck" and will terminate before the training error gets to zero. This may happen when |F| < m, and 
all the columns of F* are spanned by F, so no new linearly independent vectors can be added to F, |F*| 
will be zero, and the algorithm will terminate (see Figure[T]l. However, we have never witnessed this happen 
in any of our experiments, and we can prove that this is indeed the case as long as the input instances are in 
"general position" (which we shortly formalize). Thus, we get a completely analogous result to Thm.[T} for 
the more practical variant of the Basis Learner. 

Intuitively, the general position condition we require implies that if we take any two columns Fi , Fj in 
F, and |F| < m then the product vector Fi o Fj is linearly independent from the columns of F. This is 
intuitively plausible, since the entry-wise product o is a highly non-linear operation, so in general there is 
no reason that Fi o Fj will happen to lie exactly at the subspace spanned by F's columns. More formally, 
we use the following: 

Definition 1. Let xi, . . . , x^ be a set of distinct points in M . We say that xi, . . . , x^ are in M-general 

position if for every m monomials, gi, . . . , g^, the m x m matrix M defined as Mij = gj{xi) has rank m. 

The following theorem is analogous to Thm. [T] The only difference is in item [5} in which we use the 
M-general position assumption. 

Theorem 2. Given a training set (xi, yi), . . . , (x^,, Vm), where xi, . . . , x^ are distinct points in M , sup- 
pose we run the algorithm in FigureU] with the subroutines implemented in Figure^ using a uniform value 
for the width 7 and batch size h, constructing a network of depth A. Then: 

1- \F\ ^ 7 A maxt |F*| < 7, max^ |-F*| < 7^. 

2. Assume (for simplicity) that d < m, and the case of regression or classification with a constant 
number of classes. Then the algorithm can be implemented using at most 0{m{d + 7A)) memory 
and C(Ar7i(76 + A7^/6)) time, plus the polynomial time required to solve the convex optimization 
problem when computing the output layer, and the SVD in CreateBasis^ (see remark below). 

3. The network constructed by the algorithm has at most A layers, with at most 7 nodes in each layer 
The total number of nodes is at most niin{m, (A — 1)7} + 1. The total number of arithmetic operations 
(sums and products) performed to compute an output is 0{^{d + A)). 

4. The training error of the network created by the algorithm is monotonically decreasing in A. 

5. If the rows of the matrix B returned by the width-limited variant of BuildBasis^ are in M-general 
position, A is unconstrained, and there exists some vector of prediction values v such that ^(v, y) = 
0, then after at most m iterations, the training error will be 0. 

Proof. The proof of the theorem, except part [5} is a simple adaptation of the proof of Thm. [T] using our 
construction and the remarks we made earlier. So, it is only left to prove part[5] The algorithm will terminate 
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before driving the error to zero if at some iteration we have that the columns of F* are spanned by F and 
\F\ < m. But, by construction, this implies that there are |F| < m monomials such that if we apply them 
on the rows of B, we obtain linearly dependent vectors. This contradicts the assumption that the rows of B 
are in M-general position and concludes our proof. D 

We note that in itempj the SVD mentioned is over an tti x (d + 1) matrix, which requires 0{md?) time 
to perform exactly. However, one can use randomized approximate SVD procedures (e.g. |[T2]| ') to perform 
the computation in 0{md'y) time. While not exact, these approximate methods are known to perform very 
well in practice, and in our experiments we observed no significant degradation by using them in lieu of 
exact SVD. Overall, for fixed A, 7, this allows our Basis Learner algorithm to construct the network in time 
linear in the data size. 

Overall, compared to Thm. [TJ we see that our more practical variant significantly lowers the memory 
and time requirements (assuming A, 7 are small compared to m), and we still have the property that the 
training error decreases monotonically with the network depth, and reduces to zero under mild conditions 
that are likely to hold on natural datasets. 

Before continuing, we again emphasize that our approach is quite generic, and that the criterions we 
presented in this section, to pick a partial basis at each iteration, are by no means the only ones possible. For 
example, one can use other greedy selection procedures to pick the best columns in BuildBasis*, as well 
as unsupervised methods. Similarly, one can use supervised methods to construct the first layer. Also, the 
width of different layers may differ. However, our goal here is not to propose the most sophisticated and best- 
performing method, but rather demonstrate that using our approach, even with very simple regularization 
and greedy construction methods, can have good theoretical guarantees and work well experimentally. Of 
course, much work remains in trying out other methods. 

4 Sample Complexity 

So far, we have focused on how the network we build reduces the training error. However, in a learning 
context, what we are actually interested in is getting good generalization error, namely good prediction in 
expectation over the distribution from which our training data was sampled. 

We can view our algorithm as a procedure which given training data, picks a network of width 7 and 
depth A. When we use this network for binary classification (e.g. by taking the sign of the output to be the 
predicted label), a relevant measure of generalization performance is the VC-dimension of the class of such 
networks. Luckily, the VC-dimension of neural networks is a well-studied topic. In particular, by Theorem 
8.4 in f2\, we know that any binary function class in Euclidean space, which is parameterized by at most 
n parameters and each function can be specified using at most t addition, multiplication, and comparison 
operations, has VC dimension at most 0{nt). Our network can be specified in this manner, using at most 
0{'-f{d + A)) operations and parameters (see Thm. [2]l. This immediately implies a VC dimension bound, 
which ensures generalization if the training data size is sufficiently large compared to the network size. We 
note that this bound is very generic and rather coarse - we suspect that it can be substantially improved in 
our case. However, qualitatively speaking, it tells us that reducing the number of parameters in our network 
reduces overfitting. This principle is used in our network architecture, where each node in the intermediate 
layers is connected to just 2 other nodes, rather than (say) all nodes in the previous layer. 

As an interesting comparison, note that our network essentially computes a A-degree polynomial, yet 
the VC dimension of all A-degree polynomial in M'^ is ( ^ ), which grows very fast with d and A [3]. 
This shows that our algorithm can indeed generaUze better than directly learning high-degree polynomials, 
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which is essentially intractable both statistically and computationally. 

It is also possible to prove bounds on scale-sensitive measures of generalization (which are relevant if 
we care about the prediction values rather than just their sign, e.g. for regression). For example, it is well- 
known that the expected squared loss can be related to the empirical squared loss over the training data, 
given a bound on the fat- shattering dimension of the class of functions we are learning [2|. Combining 
Theorems 11.13 and 14.1 from [2|, it is known that for a class of networks such as those we are learning, 
the fat-shattering dimension is upper-bounded by the VC dimension of a slightly larger class of networks, 
which have an additional real input and an additional output node computing a linear threshold function in 
M^. Such a class of networks has a similar VC dimension to our original class, hence we can effectively 
bound the fat-shattering dimension as well. 

5 Relation to Kernel Learning 



Kernel learning (see e.g. 112211 ) has enjoyed immense popularity over the past 15 years, as an efficient and 
principled way to learn complex, non-linear predictors. A kernel predictor is of the form Y^- aik{^i, •), 
where xi, . . . , x^ are the training instances, and A;(-, •) is a kernel function, which efficiently computes an 
inner product (^(•) , ^(•)) in a high or infinite-dimensional Hilbert space, to which data is mapped implicitly 
via the feature mapping ^. In this section, we discuss some of the interesting relationships between our work 
and kernel learning. 

In kernel learning, a common kernel choice is the polynomial kernel, A;(x,x') = (1 + (x,x'))^. It is 
easy to see that predictors defined via the polynomial kernel correspond to polynomial functions of degree 
A. Moreover, if the Gram matrix (defined as Gij = k{-x.i, Xj)) is full-rank, any values on the training data 
can be realized by a kernel predictor: For a desired vector of values y, simply find the coefficient vector a 
such that Ga = y, and note that this implies that for any j, J2i aifc(xj, Xj) = yj. Thus, when our algorithm 
is ran to completion, our polynomial network can represent the same predictor class as kernel predictors with 
a polynomial kernel. However, there are some important differences, which make our system potentially 
better: 

• With polynomial kernels, one always has to manipulate anm x m matrix, which requires memory 
and runtime scaling at least quadratically in m. This can be very expensive if m is large, and hinders 
the application of kernel learning to large-scale data. This quadratic dependence on m is also true 
at test time, where we need to explicitly use our m training examples for prediction. In contrast, the 
size of our network can be controlled, and the memory and runtime requirements of our algorithm 
is only linear in m (see Thm. |2]). If we get good results with a moderately-sized network, we can 
train and predict much faster than with kernels. In other words, we get the potential expressiveness 
of polynomial kernel predictors, but with the ability to control the training and prediction complexity, 
potentially requiring much less time and memory. 

• With kernels, one has to specify the degree A of the polynomial kernel in advance before training. 
In contrast, in our network, the degree of the resulting polynomial predictor does not have to be 
specified in advance - each iteration of our algorithm increases the effective degree, and we stop when 
satisfactory performance is obtained. 

• Learning with polynomial kernels corresponds to learning a linear combination over the set of poly- 
nomials {(1 + (xj, •))^}j^^. In contrast, our network learns (in the output layer) a linear combination 
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of a different set of polynomials, which is constructed in a different, data-dependent way. Thus, our 
algorithm uses a different and incomparable hypothesis class compared to polynomial kernel learning. 

• Learning with polynomial kernels can be viewed as a network of a shallow architecture as follows: 
Each node in the first layer corresponds to one support vector and applies the function x i— ;• (1 + 
(xj, x))'^. Then, the second layer is a linear combination of the outputs of the first layer. In contrast, 
we learn a deeper architecture. Some empirical evidence shows that deeper architectures may express 
complicated functions more compactly than shallow architectures |i4l|9l- 

A related interesting point, which we also mentioned earlier in the paper, is that our network can com- 
pactly represent high-dimensional polynomials. Since the first layer of our network computes a linear trans- 
formation, and the next layers compute products of such terms, we end up with polynomial functions which 
are specified by a large number of product of linear terms. This product-of-sums representation allows us 
to compactly compute polynomials, whose explicit expansion as a sum of monomials is prohibitively large 
(see Eq. ([T])). In kernel learning as well, the kernel function compactly represents an inner product in a very 
high or infinite-dimensional Hilbert space, for which an explicit representation is intractable. 

Finally, we note that there exist previous works which attempt to connect kernels and deep learning, but 
in a very different way than ours. In particular, |7] propose a kernel learning algorithm, where the kernel 
mimics computations in large random neural networks. However, the resulting predictor is still a kernel 
predictor. That paper also proposes a heuristic deep architecture, where kemel-PCA and feature selection 
methods are stacked on top of each other. In contrast, we propose a principled method to directly learn a 
deep architecture, which does not rely at all on kernel computations. 

6 Experiments 

In this section, we present some preliminary experimental results demonstrating the feasibility of our ap- 
proach. We emphasize right at the outset that our goal here is not to show that we beat existing deep learning 
approaches in terms of test error. Instead, our goal is to illustrate that our approach can match their perfor- 
mance on some standard benchmarks. The important point is that our algorithm can successfully train a 
fundamentally deep learning system, using just a couple of parameters, relatively quickly, with no manual 
tuning, and with the authors having no prior experience in deep learning and no domain expertise. In con- 
trast, the reported results for the benchmarks we consider, using existing deep learning techniques, appear 
to have involved many man-hours of painstaking manual tuning and heuristic parameter searches, and were 
performed by experienced researchers specializing in deep learning. 

To study our approach, we used the benchmarks and protocol described in [ 14] [j . These benchmark 
datasets were designed to test deep learning systems, and require highly non-linear predictors. They consist 
of 8 datasets, where each instance is a 784-dimensional vector, representing normalized intensity values of 
a 28 X 28 pixel image. These datasets are as follows: 

1. MN 1ST -basic: The well-known MNIST digit recognition datasejj where the goal is to identify 
handwritten digits in the image. 

2. MNIST-rotated: Same as MNIST-basic, but with the digits randomly rotated. 



^These datasets and experimental details are publicly available at http: //www. iro.umontreal .ca/~lisa/twiki/ 



bin /view . cgi/Public/DeepVsShallowComparisonICML2 007#Downloadable_datasets 



http : //yann . lecun . com/exdb/mnist 
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3. MNIST-back-image: Same as MNIST-basic, but with patches taken from unrelated real-world 
images in the background. 

4. MNIST-back-random: Same as MNIST-basic, but with random pixel noise in the background. 

5. MNIST-rotated+back-image: Same as MNIST-back-image, but with the digits randomly 
rotated. 

6. Rectangles: Given an image of a rectangle, determine whether its height is larger than its width. 

7. Rectangles-images: Same as Rectangles, but with patches taken from unrelated real-world 
images in the background. 

8. Convex: Given images of various shapes, determine whether they are convex or not. 

All datasets consist of 12,000 training instances and 50,000 test instances, except for the Rectangles 
dataset (1200/50000 train/test instances) and the Convex dataset (8000/50000 train/test instances). We 
refer the reader to [14| for more precise details on the construction used. 

In llT4l . for each dataset and algorithm, the last 2000 examples of the training set was split off and used 
as a validation sets for parameter tuning (except Rectangles, where it was the last 200 examples). The 
algorithm was then trained on the entire training set using those parameters, and classification error on the 
test set was reported. 

The algorithms used in lfT4l involved several deep learning systems: Two deep belief net algorithms 
(DBN-1 and DBN-3), a stacked autoencode algorithm (SAA-3), and a standard single-hidden-layer, feed- 
forward neural network (NNet). Also, experiments were ran on Support Vector Machines, using an RBF 
kernel (SVM-RBF) and a polynomial kernel (SVM-Poly). 

We experimented with the practical variant of our Basis Learner algorithm (as described in Subsec- 
tion 3.3 1, using a simple , publicly-available implementation in MATLABJj As mentioned earlier in the text, 
we avoided storing F*, instead computing parts of it as the need arose. We followed the same experimental 
protocol as above, using the same split of the training set and using the validation set for parameter tun- 
ing. For the output layer, we used stochastic gradient descent to train a linear classifier, using a standard 
L2-regularized hinge loss (or the multiclass hinge loss for multiclass classification). In the intermediate 
layer construction procedure (BuildBasis*), we fixed the batch size to 50. We tuned the following 3 
parameters: 

• Network width 7 G {50, 100, 150, 200, 250, 300} 

• Network depth A € {2, 3, 4, 5, 6, 7} 

• Regularization parameter A G {10~^, 10~^'^, . . . , 10^} 

Importantly, we did not need to train a new network for every combination of these values. Instead, for every 
value of 7, we simply built the network one layer at a time, each time training an output layer over the layers 
so far (using the different values of A), and checking the results on a validation set. We deviated from this 
protocol only in the case of the MNIST-basic dataset, where we allowed ourselves to check 4 additional 
architectures: The width of the first layer constrained to be 50, and the other layers are of width 100,200,400 
or 600. The reason for this is that MNIST is known to work well with a PCA preprocessing (where the 
data is projected to a few dozen principal components). Since our first layer also performs a similar type 



Ihttp : //www. wisdom. weizmann . ac . il/~shamiro/code/BasisLearner . zipl 
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of processing, it seems that a narrow first layer would work well for this dataset, which is indeed what 
we've observed in practice. Without trying these few additional architectures, the test classification error for 
MNIST-basic is 4.32%, which is about 0.8% worse than what is reported below. 

Although it is difficult to compare man-hours, it seems that the results reported in [14] required con- 
siderably more work than ours. In the website companion, the authors describe the sophisticated parameter 
search heuristics they used, which involved several stages of hyper-parameter tuning and iteratively refined 
grid searches, some of which were apparently performed manuall>Q Moreover, the authors state that at 
least for some of the deep learning methods used, a single training of the model required more than a day. 
In contrast, our parameter tuning was coarsely discretized, automated, and involved just a few parameters. 
We used a simple non-optimized MATLAB implementation, and for each width choice, building a family 
of networks for all A and 7 required at most a couple of hours using a single CPU. 

We report the test error results (percentages of misclassified test examples) in the table below. Each 
dataset number corresponds to the numbering of the dataset descriptions above. For each dataset, we report 
the test error, and in parenthesis indicate the depth/width of the network (where depth corresponds to A, so 
it includes the output layer). For comparison, we also include the test error results reported in [14J for the 
other algorithms. Note that most of the MNIST-related datasets correspond to multiclass classification with 
10 classes, so any result achieving less than 90% error is non-trivial. 
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From the results, we see that our algorithm performs quite well, building deep networks of modest size 
which are competitive with (and for the Convex dataset, even surpasses) the previous reported results. 
The only exception is the Rectangles dataset (dataset no. [6]l, which is artificial and very small, and we 
found it hard to avoid overfitting (the training error was zero, even after tuning A). However, compared 
to the other deep learning approaches, training our networks required minimal human intervention and 
modest computational resources. The results are also quite favorable compared to kernel predictors, but 
the predictors constructed by our algorithm can be stored and evaluated much faster. Recall that a kernel 
SVM generally requires time and memory proportional to the entire training set in order to compute a single 
prediction at test time. In contrast, the memory and time requirements of the predictors produced by our 
algorithm are generally at least 1 — 2 orders of magnitudes smaller. 

It is also illustrative to consider training/generalization error curves for our algorithm, seeing how 
the bias/variance trade-off plays out for different parameter choices. We present results for the 
MNIST-rotated dataset, based on the data gathered in the parameter tuning stage (where the algorithm 
was trained on the first 10,000 training examples, and tested on a validation set of 2,000 examples). The 



E.g. "As for the optimization hyper-parameters, we would proceed by first trying a few combinations of values for the stochastic 
gradient descent learning rate of the supervised and unsupervised phases... The first trials would simply give us a trend on the 
validation set error for these parameters... and we would then consider that information in selecting appropriate additional trials". 
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Figure 5: Training and Validation Error Curves for the MNIST-Rotated dataset, as a function of trained 
network width and depth. 

results for the other datasets are qualitatively similar. We investigate how 3 quantities behave as a function 
of the network depth and width: 

• The validation error (for the best choice of regularization parameter A in the output layer) 

• The corresponding training error (for the same choice of A) 

• The lowest training error attained across all choices of A 

The first quantity shows how well we generalize as a function of the network size, while the third quantity 
shows how expressive is our predictor class. The second quantity is a hybrid, showing how expressive is our 
predictor class when the output layer is regularized to avoid too much overfitting. 

The behavior of these quantities is presented graphically in Figure [5] First of all, it's very clear that 
this dataset requires a non-linear predictor: For a network depth of 2, the resulting predictor is just a lin- 
ear classifier, whose train and test errors are around 50% (off-the-charts). Dramatically better results are 
obtained with deeper networks, which correspond to non-linear predictors. The lowest attainable training 
error shrinks very quickly, attaining an error of virtually in the larger depths/widths. This accords with our 
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claim that the Basis Learner algorithm is essentially a universal learning algorithm, able to monotonically 
decrease the training error. A similar decreasing trend also occurs in the training error once A is tuned based 
on the validation set, but the effect of A is important here and the training errors are not so small. In contrast, 
the validation error has a classical unimodal behavior, where the error decreases initially, but as the network 
continues to increase in size, overfitting starts to kick in. 

Finally, we also performed some other experiments to test some of the decisions we made in implement- 
ing the Basis Learner approach. In particular: 

• Choosing the intermediate layer's connections to be sparse (each node computes the product of only 
two other nodes) had a crucial effect. For example, we experimented with variants more similar 
in spirit to the VCA algorithm in [18], where the columns of F are forced to be orthogonal. This 
translates to adding a general linear transformation between each two layers. However, the variants 
we tried tended to perform worse, and suffer from overfitting. This may not be surprising, since these 
linear transformations add a large number of additional parameters, greatly increasing the complexity 
of the network and the risk of overfitting. 

• Similarly, performing a linear transformation of the data in the first layer seems to be important. For 
example, we experimented with an alternative algorithm, which builds the first layer in the same way 
as the intermediate layers (using single products), and the results were quite inferior. While more 
experiments are required to explain this, we note that without this linear transformation in the first 
layer, the resulting predictor can only represent polynomials with a modest number of monomials 
(see Remark[3]). Moreover, the monomials tend to be very sparse on sparse data. 

• As mentioned earlier, the algorithm still performed well when the exact SVD computation in the first 
layer construction was replaced by an approximate randomized SVD computation (as in llT2i ). This 
is useful in handling large datasets, where an exact SVD may be computationally expensive. 

We end by emphasizing that these experimental results are preliminary, and that much more work re- 
mains to further study the new learning approach that we introduce here, both theoretically and experimen- 
tally. 
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